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Abstract 

Background: In this paper we deal with modeling serum proteolysis process from tandem mass spectrometry 
data. The parameters of peptide degradation process inferred from LC-MS/MS data correspond directly to the 
activity of specific enzymes present in the serum samples of patients and healthy donors. Our approach integrate 
the existing knowledge about peptidases' activity stored in MEROPS database with the efficient procedure for 
estimation the model parameters. 

Results: Taking into account the inherent stochasticity of the process, the proteolytic activity is modeled with the 
use of Chemical Master Equation (CME). Assuming the stationarity of the Markov process we calculate the 
expected values of digested peptides in the model. The parameters are fitted to minimize the discrepancy 
between those expected values and the peptide activities observed in the MS data. Constrained optimization 
problem is solved by Levenberg-Marquadt algorithm. 

Conclusions: Our results demonstrates the feasibility and potential of high-level analysis for LC-MS proteomic data. 
The estimated enzyme activities give insights into the molecular pathology of colorectal cancer. Moreover the 
developed framework is general and can be applied to study proteolytic activity in different systems. 



Background 

Motivation and related research 

Recent advances in high throughput technologies, which 
evaluate tens of thousands of genes or proteins in a single 
experiment, are providing new methods for identifying 
biochemical determinants of the disease process. One of 
the experimental technologies allowing us to study mole- 
cular basis underlying specific disease phenotype is mass 
spectrometry (MS) [1,2]. Observed large variability in 
mass spectrometry images of blood samples was attributed 
to ex vivo proteolysis. 

Paradoxically, one can take advantage of these findings 
in cancer diagnostics [3,4] as diagnostic peptides originate 
after ex vivo proteolytic processing of high abundance 
protein fragments. 
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As development in hardware and software progresses, 
we can obtain better and better estimates of peptide con- 
centrations in body fluids, which give many insights into 
the peptide degradation process. Proteolysis modeled in 
this paper is the process in which a protein is broken 
down partially, into peptides, or completely, into amino 
acids, by proteolytic enzymes present in blood serum. 
Among proteolytic enzymes two main groups are distin- 
guished. One group includes exopeptidases which require 
a free N-terminal amino group, C-terminal amino group 
or both and cut the peptide not more than three amino 
acids from the terminus. Enzymes belonging to the sec- 
ond group are called endopeptidases and they tend to 
cleave away from the end of the peptide. 

Our results 

In this paper we present formal mathematical model 
describing serum proteolysis dynamics. We focus here 
on the activity of peptide cutting enzymes (peptidases). 



o 
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The model parameters are inferred from liquid chroma- 
tography tandem mass spectrometry data (LC-MS/MS). 

The dynamical changes in peptide composition caused 
by proteolytic degradation are described by means of bio- 
chemical reactions network. It corresponds to Markov 
process whose evolution is governed by the system of 
stochastic differential equations (i.e. Chemical Master 
Equation). 

The current approach significantly extends the exopep- 
tidase activity model presented in [5]. The integration 
with peptidase database MEROPS (http://merops.sanger. 
ac.uk) [6,7] allows for modeling the endopeptidase activ- 
ity as well. Moreover the model parameters inferred from 
MS data correspond directly to specific proteolytic 
enzymes present in each sample. On the other hand tak- 
ing the splitting reaction (coming from endopeptidase 
cleavage A — > B + C) into account significantly compli- 
cates the mathematical description. There is no analytical 
solution of the CME as in [5]. Instead we calculate the 
expected amount of peptides in the stationary state of the 
process. Those values are compared to the MS readouts 
for the corresponding peptides. The model parameters 
are calculated to minimize the discrepancy between 
expected and observed amount of each peptide. 

Organization of the paper 

We start by description of our model presented with the 
use of so called cleavage graph, then we present computa- 
tional methods to interpret MS data (to fill the graph with 
appropriate readouts values) and to infer the model para- 
meters. The constrained optimization problem is formu- 
lated and solved with Levenberg-Marquadt [8] algorithm. 
We estimate the convergence and statistical significance of 
estimation procedure outcomes. Finally, identified active 
peptidases for both groups of healthy donors and colorec- 
tal cancer patients are presented. A preliminary version of 
this paper was presented at 1st IEEE International Confer- 
ence on Computational Advances in Bio and medical 
Sciences (ICCABS 2011). In this extended version we have 
significantly modified the Results section, by updating the 
MEROPS data, fixing some errors in scripts for data pre- 
processing and presenting more statistical analyses. 

Model of proteolysis process 

To illustrate the process of peptide degradation we 
introduce the cleavage graph, whose vertices correspond 
to peptides and proteolytic events. More formally, con- 
sider the bipartite digraph, (V U W, £), where the first 
set V {peptide nodes) corresponds to all subsequences 
of the peptides considered and the second set of event 
nodes W corresponds to all possible proteolytic events. 

By proteolytic event we mean the cleavage of a speci- 
fic substrate at specific site made by a specific peptidase. 



Hence each event node is labelled by a peptidase, and 
has one ingoing edge and two outgoing edges (leading 
to peptide prefix and suffix obtained by cutting the sub- 
strate at a single site). 

Now we visualize the peptide subsequences as parti- 
cles placed at peptide nodes of the cleavage graph. The 
particles are flowing through the edges of the graph 
according to the Petri net operational semantics, i.e. the 
transition (event node) consumes one substrate particle, 
and produces two particles. To assure the stationarity of 
the system we allow for creation and degradation of par- 
ticle at any node. We also add the source and the sink 
in the graph modeling the creation of precursor peptides 
(e.g. caused by the activity of some endopeptidases, 
which is not captured by our model) and complete 
degradation of short peptides. The cleavage graph is 
constructed for every processed MS sample. The peptide 
nodes are appropriately filled with mass spectrometry 
readouts and specific enzymes are assigned to event 
nodes according to data about real cleavage events (see 
the next section for details). 

A small exemplary fragment of the cleavage graph is 
depicted in Figure 1 five proteolytic events which engage 
four peptidases are presented. For u, v , w eV we use 
the notation u = v\w when peptides v and w can be 
obtained directly by cutting u (v is a non-empty strict 
prefix and w is a non-empty strict suffix of u). 

The operation t can be viewed as string concatena- 
tion. To identify a cleavage site we write simply v\w. 
Denote by V the set of all peptidases whose activity is 
modeled. Coefficients p% w (for peptidase p e V and 
cleavage vf w) put near the event nodes in Figure 1 cor- 
respond to the affinity between the peptidase cleavage 
pattern and the cleavage site composition (we call them 
affinity coefficients). They are defined for every possible 
pair of cleavage v\w and peptidase p and calculated at 
the graph construction stage. We assume that the clea- 
vage process has reached the equilibrium. Then for 
every peptide node v eV the following balance equa- 
tion [9] holds: 

(Pi + E XuPvwh + E x up r qv K = Xv{<pt~ + E Ply^t) (1) 

u=v\w u=q\v v=x\y 

where (p* is an activity of creation the sequence 
represented by v, (p„ is a degradation activity, x u and x v 
are expected amounts of peptides u and v, p p xy is an affi- 
nity coefficient and k p is the activity of cleaving by the 
peptidase p engaged in the cleavage xfy. Left hand side 
of the equation above refers to the sum of particles that 
flow into the node v, while right hand side to the sum 
of particles that flow out from this node. From the bal- 
ance equation above one can easily calculate the 
expected number of particles in every peptide node x v . 
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Methods 

In this section we describe the process of cleavage graph 
construction. It has several phases: firstly the set of 
nodes are determined. Peptide nodes correspond to the 
sequences identified in tandem MS experiment, while 
event nodes are selected carefully according to the 
knowledge from MEROPS database (version 9.4.). Dur- 
ing the second stage the graph should be filled with 
appropriate readouts from LC-MS spectra. To this aim 
we have to determine which signal in two-dimensional 
spectral map corresponds to a given peptide sequence 
(i.e. node in the graph) and to assign to this node the 
number of particles reflecting the signal strength. 

Having the cleavage graph we solve the constrained 
optimization problem to infer the unknown enzyme 
activity coefficients which minimize the discrepancy 
between expected number of peptides (calculated 
according to the model) and the observed signals in MS 
samples. 



Cleavage graph construction 

Let us define the set of amino acids together with the 
space letter X = {A, C, D, . . .} U {— } and the set of loci 
surrounding (4 from both sides) the cleavage site 
J = {P4,...,P1,P1 / ,...,P4 / }. 

For each peptidase p e V we construct (based on data 
collected in MEROPS database) the frequency matrix 

F p = {FP)ij for i e 1, ) ej. The value j* is the fre- 
quency of amino acid i on position j in all cleavage 
events, in which p is involved. Exemplary frequency 
matrices for elastase and trypsin enzymes are presented 
in Figure 2. 

Using frequency matrix we construct so called 
sequence logo [10] to represent the consensus sequence 
surrounding given cleavage site. Then for any peptidase 
p e V we detect the cleavage event cutting given pep- 
tide sequence if it matches the consensus sequence well. 
For more detailed description see Web Supplement 
(http://bioputer.mimuw.edu.pl/papers/proteolysis/). 
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Figure 2 Frequency matrix for elastase-2 and trypsin-1. Frequency matrices for elastase-2 (left) and trypsin-1 (right) based on MEROPS 
database [6]. Color scales are different for each matrix. 
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Affinity coefficients 

Let us consider cleavage v\w made by peptidase p e V . 
Assume, that the cleavage point is surrounded by the 
sequence of amino acids (possibly with some empty 
positions at the ends) apk...ap\(ipv ...apy where 
1 < fe, / < 4. 

We define J' = {Pfc, . . . , PI, PV ', . . . , Pi'} c J . Then 
we calculate the coefficients pl w as follows (y is the 
normalization constant): 

/4 = y(niF (2) 

jej> 

Filling the graph with LC-MS readouts 

MS samples were acquired from the blood serum of 20 
colorectal cancer patients and 19 healthy donors. Each 
sample was digested by trypsin before LC-MS proces- 
sing. Having so called precursor peptides (which were 
sequenced by tandem MS), we determine all their subse- 
quences that can be observed during the cleavage pro- 
cess; they form the peptide nodes of the cleavage graph. 
To this end we digest the precursor peptides in silico 
using the set of human enzymes from MEROPS. 

Then we look for corresponding signals in MS spectra 
as follows: using mz2m tool [11] we obtain a list of 
mono-isotopic peak coordinates (m/z, retention time 
and charge) together with their intensities. The search is 
processed for each sequence charged by each of eight 
possible charges (values from 1 to 8) detected by FTICR 



spectrometer. Expected value of retention time for each 
peptide is predicted from its amino acid composition by 
linear regression model [12] (trained by the set of 
known retention times for precursor peptides). Then we 
look for MS signals that are closest to expected ones 
(nearest neighbor classifier) and their distances do not 
exceed given threshold. Notice, that we ignore LC-MS 
signals corresponding to peptides not sequenced by tan- 
dem MS and theirs degraded forms. Therefore we use 
only partial information about degradation scheme, and 
possibly the activity of some enzymes engaged in the 
process cannot be inferred. By applying this procedure 
we fill many peptide nodes by appropriate peptide 
amount. However a large percent of the nodes remain 
empty. Finally, we prune the cleavage graph by remov- 
ing recursively empty sources and empty leaves. 

Constrained optimization 

Let us denote by S > C C V respectively, sets of sources 
(i.e. nodes without ingoing edges) and leaves (nodes 
without outgoing edges) in the cleavage graph. Let 
n = \S\ + \C\ + \V\ and z = («) we <s, («J- W (^W) e K n 
denote the vector of model parameters to be inferred. 
We are mainly interested in estimation of the para- 
meters (Xp)p e -p which describes activities of peptidases. 
We define O = (4> v )veV recursively for all y e V sorted 
topologically: <p* 
1. if v € S then W z ) = £ ^ v 
v=xty 
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and 



then 



u=v\w 



<t>v{ z ) 



3.ifv££ then 



E Ply^t 
v=xty 

E < 

u=v\w 



»u(z)/?L^p+ E <t>u{z)p r q v K 



u=q\v 



The graph pruning grants that S nC=fl, and the 
topological ordering assures that the function v is well- 
defined. 

Denote by y t the amount of peptide sequences identi- 
fied in LC-MS/MS experiment in the j-th peptide node, 
and by O c V set of vertices with y x > 0, \0\ = m. We 
solve non-linear least squares problem with objective 
function *I> = (\ls v ) ve o defined for each y e O by the 
formula ^ v (z) = (j) v (z) — y Vl which is well formulated and 
solvable for m > n (fortunately holding in our case for 
all investigated MS samples). We applied Levenberg- 
Marquadt algorithm (LMA) [13] to find optimal config- 
uration of model parameters. 

Compositional data 

To make the outcome of estimation procedure compar- 
able across different MS samples we normalized the vec- 
tor of parameters corresponding to peptidases' activities. 
Notice, that normalization does not change the value of 
function (p v for any y e Q . The normalized activities, 
say vector x, lies on the simplex, therefore we should 
apply the appropriate transformation (called centered 
log ratio clr{x)) to deal with them in Euclidean space 
(see the theory of compositional data analysis [14] for 
details). Let g(x) denotes the geometric mean for vector 
x, centered log ratio is defined as follows: 



clr(x) = ( In — — 

SM/i-l m 



Results 

The optimization procedure was applied to infer 
the enzymatic activity for 39 LC-MS samples, i.e. 
for each sample we obtained optimal parameters 

Z = ({<Pu)ueS, {$w)weCr &p)pev)- 

We run LMA for each data set 7 times (each time 
from different starting point) and use the maximal num- 
ber of iterations set up to 200 as a stop criterion. To 
measure the quality of estimation we use relative 
squared errors (rse) [15], i.e.: 



rse(z) 



Eieo in - nf 

where y { = 0 f (£) and y { = ± J2ieoYi< 
Adequacy of the model 

Aiming in justifying the adequacy of the proposed 
model we made the following experiment. The estima- 
tion procedure was run to obtain the expected number 
of peptide sequences y v in every peptide node y e O . 
Then we have filled the cleavage graph with synthetic 
data y* generated independently from normal distribu- 
tions with mean y v and standard deviation a e {0.1, 
0.01, 0.001} (with additional constraint y£ > 0). In this 
way we obtained three synthetic data sets, which are in 
the good, moderate and weak accordance with our 
model. As we expect, the quality of estimation proce- 
dure reflects the discrepancy with the model. Figure 3 
compares median relative squared error for real data 
and three data sets generated from the model with dif- 
ferent level of discrepancy. 

Statistical significance of estimation quality 

Optimization procedure yielded rather small rse errors 
for most samples. However, we were interested how the 
final relative squared error depends on the input data, 



real 

std = 0.1 
std = 0.01 
std = 0.001 




real 

std = 0.1 
std = 0.01 
std = 0.001 




Figure 3 Comparison of median value of relative squared error. Comparison of median value of relative squared errors for real data and 
synthetic data generated according to the model (plot for the sample no. 5 on the left and for the sample no. 19 on the right). 
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and whether results obtained by us are statistically sig- 
nificant. To answer this question for each MS sample x 
we have generated vector £ of 1000 randomly permuted 
variants (i.e. the topology of cleavage graph remained 
the same, while the amounts of peptides assigned to 
nodes were permuted). Then we run the optimization 
procedure for all data set to obtain the experimental dis- 
tribution of rse values. Now, we can define p-value for 
rse(x) as follows: 



p - val(rse(x)) = 



{i : rse(ff) < rse(x)} 
1000 



(3) 



Table 1 presents p-values and corresponding rse 
values for all analysed MS samples. 

Biological significance of inferred enzymes 

Figure 4A and Figure 5A present the inferred peptidases' 
activities for samples no. 5 (healthy donor) and 19 (col- 
orectal cancer patient). Subfigures B-D illustrate the 
accuracy of estimation procedure for synthetic data sets 
for which the estimated parameters are known and 
equal to those inferred from real data (red line). We 
observe once more, that the quality of estimation corre- 
lates well with the discrepancy of data (for smallest stan- 
dard deviation, i.e. Figure 4D and Figure 5D, the 
estimated parameters are close to real ones). Analogous 
results are obtained for other analyzed samples. 

The set of identified enzymes do not vary significantly 
between all investigated samples: there are 6 peptidases 



Table 1 Final relative squared errors (rse) and p-values 
(calculated from rse distribution). 
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identified in all samples and 19 peptidases found in at 
least one sample (listed in Figure 6). For further analysis 
we selected 37 samples (19 healthy, 18 diseased), 
for which acceptable estimates had been obtained (c.f. 
Table 1). We excluded two colorectal cancer samples 
(no. 17 and 39) as they perform significantly different 
than others (one required much higher threshold in 
nearest neighbor classifier during MS signal detecting 
phase and the other returned relative squared error 
much higher than 1). 

Heatmap in Figure 6 presents the activities of pepti- 
dases identified for these samples. Hierarchical clustering 
of activity profiles groups samples into clusters being in 
good accordance with patient's diagnosis. The diverse 
serum proteolytic activity for cancer patients and healthy 
donors has been reported in many papers (see e.g. [3]). In 
[16] has been showed that patients exhibit different enzy- 
matic activities than healthy subjects for following pepti- 
dases (identified by our method as well): trypsin, 
cathepsin D and elastase (c.f. Figure 6). We have also 
detected the family of matrix metallopeptidases, whose 
role in cancer development and progression is significant 
[17,18]. Similarly calpain enzyme is used as a marker for 
the early detection of colorectal carcinoma [19] and inhi- 
bitors of cathepsins as possible therapeutics in colorectal 
diseases [20]. Moreover, cathepsins (because of their abil- 
ity to degrade extracellular matrix proteins) have been 
implicated to play a role in invasion and metastasis of 
colorectal cancer. 

We have conducted principal component analysis for 
enzyme activities inferred for 19 samples having smallest 
p-values (c.f.Table 1). The scatterplots in Figure 7 illus- 
trate the outcome of the analysis. Well separation of two 
groups of patients is visible on projection to the plane 
determined by the second and third principal compo- 
nent. Closer look at corresponding loadings suggests the 
crucial role of elastase and cathepsin enzymes in those 
components. 

Conclusions 

In this paper we significantly extend formal model of 
protein degradation proposed in [5]. The extension is 
twofold: firstly current approach encompasses endopep- 
tidase activity as well (while [5] deals with exopeptidases 
only), and secondly we integrate our model with knowl- 
edge about proteolytic events stored in MEROPS data- 
base [6]. Moreover, we formulate the task of inferring 
parameters of our model as constrained optimization 
problem, which we solve by standard procedure for 
non-linear least squares. This approach turned out to be 
more time efficient for complex MS data when compar- 
ing to previous Markov Chain Monte Carlo method (in 
[5] the Metropolis-Hastings algorithm was applied to 
sample parameters from the posterior distribution). 
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Being aware of the problems with quality and reprodu- 
cibility of the LC-MS experiments we selected for 
detailed analysis only a part of accessible data, namely 
those for which the parameter estimation procedure con- 
verges and yields small error. The expected retention 
time for investigated substances is obtained by rather 
unsophisticated approach (i.e. linear regression model), 
which may have impact on the analysis. Preliminary out- 
comes for these samples are very promising: identified 
enzymes are known to play a crucial role in colorectal 
cancer. However, our results are far from any medical 
diagnosis. The proposed method constitutes the proof of 
concept and requires more profound investigations meet- 
ing all clinical standards. We also should discuss here the 
limitations of our methods applied to MS data obtained 
by present technologies. There is a lot of tryptic peptides 
which are not identified by tandem mass spectrometry. 
LC-MS signals corresponding to these peptides and 
theirs degraded forms are missed during cleavage graph 
filling phase. Therefore the inference of proteolytic 
enzymes' activities is based on only partial information 
and could be incomplete as well. However, it is worth to 
noting here that our method would demonstrate its full 
potential while applied to high quality data hopefully 
obtained from the future MS technologies. One direction 
for further development is to focus on cleavage detection 
and to apply recently proposed ice-logo instead of 
sequence logo [21]. Ice-logo contains the information not 
only about residues that are statistically overrepresented 
but also about those, that are underrepresented. This 
approach could be valuable especially in cases, when, as 
in the case of data stored in MEROPS database, we know 
only fraction of all proteolytic events. 
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